Skip to main content

How It Works

In the fast-paced world of modern IT operations, where downtime can cost millions and complex systems span clouds, containers, and microservices, finding the root cause of performance issues has become a holy grail. Traditional AIOps approaches rely heavily on machine learning models crunching traces, logs, and metrics, hoping to "discover" anomalies amid the noise. But let's be honest—how often do these methods deliver in real production environments? We've spoken with countless enterprises, and the feedback is consistent: algorithms falter in generalization, explanations are opaque, and results feel superficial. If even seasoned engineers struggle to pinpoint issues without trial-and-error, how can we expect AI to magically do it better?

The truth is, most faults aren't hidden in KPI fluctuations or statistical patterns—they stem from intricate program behaviors: how threads interact with locks, disks, CPU schedulers, futexes, epolls, and sockets. It's not about data mining; it's about deeply understanding runtime mechanisms and observing fault propagation paths. That's where our SRE Agent comes in. Built on cutting-edge eBPF technology, it dives straight into the kernel to capture thread-level interactions with system resources. No more guessing—we reconstruct the crime scene with precision, using expert rules and proven algorithms to bridge the "last mile" of root cause analysis. Skeptical? We get it. We've faced the "sounds too good to be true" reactions. But we're not hyping vaporware; we're delivering grounded, effective solutions. And now, a groundbreaking arXiv paper from May 2025 backs us up: "eBPF-Based Instrumentation for Generalisable Diagnosis of Performance Degradation." This research validates our core philosophy, proving that eBPF-driven insights can diagnose issues across applications without traces or logs—accurately, explainably, and efficiently.

The Core Challenge: Beyond Surface-Level Metrics

Picture this: Your Kafka cluster is lagging, MySQL queries are timing out, or a microservice chain is amplifying delays. Conventional tools might flag high CPU usage or spiked latencies, but they rarely explain why. Is it lock contention? Disk bottlenecks? External dependencies? System-level metrics are too coarse—they miss the granular "which thread is waiting on what resource" details. Worse, many diagnostics tie to specific languages, middleware, or logging formats, limiting their portability across diverse stacks.

The paper tackles this head-on by defining two pivotal hurdles:

  • Insufficient Granularity: Aggregate stats obscure thread-specific behaviors.
  • Poor Generalizability: Methods locked to app-layer data can't scale across systems.

Enter a universal, cross-language framework: eBPF instrumentation that profiles "thread behavior portraits" via kernel interactions. Our SRE Agent mirrors this, focusing on the essentials to make root cause analysis actionable and trustworthy.

Building the Foundation: A Robust eBPF Indicator System

At the heart of this approach is a curated set of 16 eBPF metrics across six kernel subsystems, designed to capture how threads engage with critical resources. Here's a snapshot:

SubsystemKey Metrics ExamplesWhat It Reveals
SchedulingRuntime, RQ time, IOWait timeTime spent on CPU, runqueues, or I/O waits
FutexFutex wait time, Wake countLock contention and wakeup frequencies
Pipe/SocketPipe wait time, Socket wait countInter-thread communication delays
EpollEpoll wait time, Epoll file waitAsync I/O bottlenecks
Block I/OSector countDisk pressure or contention
VFS/NetworkVarious wait and access frequenciesThread-level resource usage views

These aren't blanket captures—we intelligently target only relevant threads tied to your application, minimizing overhead. No full-system tracing bloat; just focused, low-impact monitoring that keeps your production humming.

eBPF Indicator System Hierarchy (Diagram)

This diagram illustrates the hierarchical structure of the 16 eBPF metrics, organized by kernel subsystems, highlighting how they capture thread-resource interactions.

The Diagnostic Magic: Selective Tracking and Causal Inference

Diagnosis isn't about dumping data—it's about smart analysis. The paper outlines a streamlined workflow that aligns perfectly with our SRE Agent:

  1. Identify Entry Threads: Spot service-facing threads via socket or epoll waits.
  2. Trace Dependencies: Follow interactions (pipes, sockets, futexes) to build a chain of related threads.
  3. Detect Anomalies: Align thread metrics with business KPIs (e.g., P95 latency). Look for distribution shifts to flag bottlenecks.
  4. Infer Constraints: Trace back to shared resources causing blocks.
  5. Explain Everything: Output clear paths like "Thread X blocked by Thread Y for Z ms due to disk contention"—no black boxes, just verifiable causal chains.

This "causal chain backtracking" leverages resource interactions over trace spans, making it more reliable and app-agnostic. Our agent enhances this with expert rules, ensuring diagnoses are not only accurate but also tailored to real-world SRE needs.

Diagnostic Workflow (Diagram)

This flowchart visualizes the step-by-step diagnostic process, emphasizing the causal chain from detection to explanation.

Proven in the Real World: Experiments That Deliver

The paper puts theory to the test across benchmarks like MySQL (mixed disk/lock issues), Redis (CPU bottlenecks), Kafka (external blocks), and Teastore (microservice cascades). Results? Spot-on root causes with high accuracy, full explainability, and negligible overhead (e.g., just 0.3ms added in Redis). Our data collection aligns closely, with slight tweaks in analysis angles, confirming this isn't niche—it's scalable.

Key takeaways echoing our vision:

  • Thread-Centric: Ditch process-level views for precise granularity.
  • Resource-Focused: Base everything on interactions, not assumptions.
  • Noise-Free: Track only what's relevant.
  • Explainable: Every insight has a traceable path.
  • Universal: Works across languages, systems, and architectures.

This isn't just validation—it's a blueprint for the next era of AIOps.

Experiment Results Overview (Diagram)

This bar chart summarizes the high accuracy rates from the paper's experiments, demonstrating the method's effectiveness in real scenarios.

Ready to Transform Your Ops? Try Our SRE Agent Today

In a PLG world, we believe in letting the product speak for itself. Our SRE Agent is designed for seamless adoption: sign up, integrate via simple eBPF probes, and watch as it uncovers hidden issues in minutes. No steep learning curves, no vendor lock-in—just reliable, explainable diagnostics that save time and headaches. Backed by this pioneering research, we're confident it'll change how you handle performance woes.

Curious? Head to our platform, deploy a free trial, and see the difference. Because in SRE, it's not about chasing anomalies—it's about understanding your systems at their core. Let's build a more resilient future, one thread at a time.

References